{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Timeseries Forecasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook explains how to use `tsfresh` in time series foreacasting.\n",
    "Make sure you also read through the [documentation](https://tsfresh.readthedocs.io/en/latest/text/forecasting.html) to learn more on this feature.\n",
    "\n",
    "We will use the stock price of Apple for this.\n",
    "In this notebook we will only showcase how to work with a single time series at a time (one stock).\n",
    "There exist another notebook in the `advanced` folder, which treats several stocks at the same time.\n",
    "Basically the same - but a bit more complex when it comes to pandas multi-indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pylab as plt\n",
    "\n",
    "from tsfresh import extract_features, select_features\n",
    "from tsfresh.utilities.dataframe_functions import roll_time_series, make_forecasting_frame\n",
    "from tsfresh.utilities.dataframe_functions import impute\n",
    "\n",
    "try:\n",
    "    import pandas_datareader.data as web\n",
    "except ImportError:\n",
    "    print(\"You need to install the pandas_datareader. Run pip install pandas_datareader.\")\n",
    "\n",
    "from sklearn.ensemble import AdaBoostRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We download the data from \"stooq\" and only store the High value.\n",
    "Please note: this notebook is for showcasing `tsfresh`s feature extraction - not to predict stock market prices :-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = web.DataReader(\"AAPL\", 'stooq')[\"High\"]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 6))\n",
    "df.plot(ax=plt.gca())\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to make the time dependency a bit clearer and add an identifier to each of the stock values (in this notebook we only have Google though)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_melted = pd.DataFrame({\"high\": df.copy()})\n",
    "df_melted[\"date\"] = df_melted.index\n",
    "df_melted[\"Symbols\"] = \"AAPL\"\n",
    "\n",
    "df_melted.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create training data sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Forecasting typically involves the following steps:\n",
    "* take all data up to today\n",
    "* do feature extraction (e.g. by running `extract_features`)\n",
    "* run a prediction model (e.g. a regressor, see below)\n",
    "* us the result as the forecast for tomorrow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In training however, we need multiple examples to train.\n",
    "If we would only use the time series until today (and wait for the value of tomorrow to have a target), we would only have a single training example.\n",
    "Therefore we use a trick: we replay the history.\n",
    "\n",
    "Imagine you have a cut-out window sliding over your data.\n",
    "At each time step $t$, you treat the data as it would be today. \n",
    "You extract the features with everything you know until today (which is all data until and including $t$).\n",
    "The target for the features until time $t$ is the time value of time $t + 1$ (which you already know, because everything has already happened).\n",
    "\n",
    "The process of window-sliding is implemented in the function `roll_time_series`.\n",
    "Our window size will be 20 (we look at max 20 days in the past) and we disregard all windows which are shorter than 5 days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rolled = roll_time_series(df_melted, column_id=\"Symbols\", column_sort=\"date\",\n",
    "                             max_timeshift=20, min_timeshift=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rolled.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting dataframe now consists of these \"windows\" stamped out of the original dataframe.\n",
    "For example all data with the `id = (AAPL, 2015-07-15 00:00:00)` comes from the original data of stock `AAPL` including the last 20 days until `2015-07-15`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rolled[df_rolled[\"id\"] == (\"AAPL\", pd.to_datetime(\"2015-07-15\"))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_melted[(df_melted[\"date\"] <= pd.to_datetime(\"2015-07-15\")) & \n",
    "          (df_melted[\"date\"] >= pd.to_datetime(\"2015-06-16\")) & \n",
    "          (df_melted[\"Symbols\"] == \"AAPL\")]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you now group by the new `id` column, each of the groups will be a certain stock symbol until and including the data until a certain day (and including the last 20 days in the past).\n",
    "\n",
    "Whereas we started with 1259 data samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(df_melted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we now have 1254 unique windows (identified by stock symbol and ending date):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rolled[\"id\"].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We \"lost\" 5 windows, as we required to have a minimum history of more than 5 days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rolled.groupby(\"id\").size().agg([np.min, np.max])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The process is also shown in this image (please note that the window size is smaller for better visibility):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./stocks.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rolled (windowed) data sample is now in the correct format to use it for `tsfresh`s feature extraction.\n",
    "As normal, features will be extracted using all data for a given `id`, which is in our case all data of a given window and a given id (one colored box in the graph above).\n",
    "\n",
    "If the feature extraction will return a row with the index `(AAPL, 2015-07-15 00:00:00)`, you now it has been calculated using the `AAPL` data up and including `2015-07-15` (and 20 days of history)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "X = extract_features(df_rolled.drop(\"Symbols\", axis=1), \n",
    "                     column_id=\"id\", column_sort=\"date\", column_value=\"high\", \n",
    "                     impute_function=impute, show_warnings=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make the data a bit easier to work with by removing the tuple-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X.set_index(X.index.map(lambda x: x[1]), drop=True)\n",
    "X.index.name = \"last_date\"\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our `(AAPL, 2015-07-15 00:00:00)` is also in the data again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.loc['2015-07-15']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just to repeat: the features in this row were only calculated using the time series values of `AAPL` up to and including `2015-07-15` and the last 20 days."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the extracted features to train a regressor.\n",
    "But what will be our targets?\n",
    "The target for the row `2015-07-15` is the value on the next timestep (that would be `2015-07-16` in this case).\n",
    "\n",
    "So all we need to do is go back to our original dataframe and take the stock value of tomorrow.\n",
    "This is done with `shift`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df_melted.set_index(\"date\").sort_index().high.shift(-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quick consistency test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y[\"2015-07-15\"], df[\"2015-07-16\"].iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target for `2015-07-15` is to forecast the value of the next day, `2015-07-16`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, we need to be a bit careful here: `X` is missing the first 5 dates (as our minimum window size was 5) and `y` is missing the last date (as there is nothing to predict on today).\n",
    "So lets make sure we have a consistent view on the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y[y.index.isin(X.index)]\n",
    "X = X[X.index.isin(y.index)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now train normal AdaBoostRegressors to predict the next time step .\n",
    "Let's split the data into a training and testing sample (but make sure to keep temporal consistency).\n",
    "We take everything until 2019 as train data an the rest as test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X[:\"2018\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X[:\"2018\"]\n",
    "X_test = X[\"2019\":]\n",
    "\n",
    "y_train = y[:\"2018\"]\n",
    "y_test = y[\"2019\":]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and do feature selection before training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_selected = select_features(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ada = AdaBoostRegressor()\n",
    "\n",
    "ada.fit(X_train_selected, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets check how good our prediction is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test_selected = X_test[X_train_selected.columns]\n",
    "\n",
    "y_pred = pd.Series(ada.predict(X_test_selected), index=X_test_selected.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The prediction is for the next day, so for drawing we need to shift 1 step back:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 6))\n",
    "\n",
    "y.plot(ax=plt.gca())\n",
    "y_pred.shift(-1).plot(ax=plt.gca(), legend=None, marker=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, clearly not perfect ;-)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
